{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Word2Vec 词向量\n",
    "将字词从文字空间映射到向量空间。  \n",
    "Map words from text space to vector space.\n",
    "\n",
    "每个字词都会有一个对应的代表其__语义__的向量。  \n",
    "Each word has a corresponding vector which represents the __semantics__ of a word.\n",
    "\n",
    "有多重方法做到这一点。  \n",
    "Many methods can achieve such a goal.\n",
    "\n",
    "有基于次数的方法。  \n",
    "There are count based methods.\n",
    "\n",
    "本教程介绍的是论文《[Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781.pdf)》中所使用的方法。这可以被视作为一种概率式方法。  \n",
    "This tutorial explains the method used in [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781.pdf). This can be seen as a probabilistic method."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__为了最大地照顾到各种同学(不会中文的同学)，我主要用英文。关键句段会有中文。__  \n",
    "__For maximum fellow learners reach（non-Chinese learners）, I will mostly use English. I will provide Chinese for some key phrases.__\n",
    "\n",
    "__英文是在学界与开发者社区是一门重要的语言。中国的同学们多看英文资料也是有好处的。__  \n",
    "__English is an important languge in academia and developer communities. Fellow learners in China can benefit a lot by reading English materials.__"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 预备知识 Pre-request:\n",
    "本教程是编程向的。2分理论，8分编程。  \n",
    "This tutorial is programming oriented. I will descripe it as 20% theory and 80% programming.\n",
    "\n",
    "所以，我假定你懂得基本的深度学习。  \n",
    "Therefore, I assume you already have understand the basics of __Deep Learning__.\n",
    "\n",
    "我也假定你有一定的TensorFlow基础。  \n",
    "I also assume that you have some experience with [__TensorFlow__](https://www.tensorflow.org/).\n",
    "\n",
    "我强烈建议你在继续本教程之前，阅读文末提供的链接。  \n",
    "I strongly encourage you to read all materials linked at the end of this notebook.\n",
    "\n",
    "#### To be specific, you ought to know:\n",
    "__1. One-hot encoding__  \n",
    "__2. Cross entropy loss__  \n",
    "__3. Maximum likelihood estimation__  \n",
    "__4. Softmax classifier__  \n",
    "__5. Fully connected neural network__  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A TensorFlow Example\n",
    "This is based on [TensorFlow website tutorial](https://www.tensorflow.org/tutorials/word2vec)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.0.0\n"
     ]
    }
   ],
   "source": [
    "### Python3\n",
    "\n",
    "# First let's import all the necessary dependencies\n",
    "import collections\n",
    "import math\n",
    "import os\n",
    "import random\n",
    "import zipfile\n",
    "\n",
    "import numpy as np\n",
    "import urllib\n",
    "import tensorflow as tf\n",
    "from tensorflow.contrib.tensorboard.plugins import projector\n",
    "\n",
    "from pandas import DataFrame\n",
    "\n",
    "# Check TensorFlow version. I encourage you to use 1.X. I use 1.0\n",
    "print(tf.__version__)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found and verified text8.zip\n",
      "Data size 17005207\n"
     ]
    }
   ],
   "source": [
    "# Step 1: Download the data.\n",
    "url = 'http://mattmahoney.net/dc/'\n",
    "\n",
    "# We want to download data from this url, if it's the first time\n",
    "def maybe_download(filename, expected_bytes):\n",
    "  \"\"\"\n",
    "  Download a file if not present, and make sure it's the right size.\n",
    "  \"\"\"\n",
    "  if not os.path.exists(filename):\n",
    "    filename, _ = urllib.request.urlretrieve(url + filename, filename)\n",
    "  statinfo = os.stat(filename)\n",
    "  if statinfo.st_size == expected_bytes:\n",
    "    print('Found and verified', filename)\n",
    "  else:\n",
    "    print(statinfo.st_size)\n",
    "    raise Exception('Failed to verify ' + filename + '. Can you get to it with a browser?')\n",
    "  return filename\n",
    "\n",
    "filename = maybe_download('text8.zip', 31344016)  # do not change this line\n",
    "\n",
    "\n",
    "# Read the data into a list of strings.\n",
    "def read_data(filename):\n",
    "  \"\"\"Extract the first file enclosed in a zip file as a list of words\"\"\"\n",
    "  with zipfile.ZipFile(filename) as f:\n",
    "    data = tf.compat.as_str(f.read(f.namelist()[0])).split()\n",
    "  return data\n",
    "\n",
    "# Now we have the data\n",
    "words = read_data(filename)\n",
    "print('Data size', len(words))  # The length of \"words\", not bytes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Most common words (+UNK) [['UNK', 418391], ('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764)]\n",
      "Sample data [5241, 3081, 12, 6, 195, 2, 3134, 46, 59, 156] ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']\n"
     ]
    }
   ],
   "source": [
    "# Step 2: Build the dictionary and replace rare words with UNK token.\n",
    "\n",
    "# We only want 50000 words in the vocabulary just for the sake of demonstration\n",
    "vocabulary_size = 50000\n",
    "\n",
    "\n",
    "def build_dataset(words, vocabulary_size):\n",
    "  # UNK is a special token we use to represent \"empty word\", or thrown away words\n",
    "  # Since we want the most common 49999 words in the data\n",
    "  # we will throw away some less commonly used words\n",
    "  count = [['UNK', -1]]\n",
    "  count.extend(collections.Counter(words).most_common(vocabulary_size - 1))\n",
    "  dictionary = dict()\n",
    "  for word, _ in count:\n",
    "    dictionary[word] = len(dictionary)\n",
    "  data = list()\n",
    "  unk_count = 0\n",
    "  for word in words:\n",
    "    if word in dictionary:\n",
    "      index = dictionary[word]\n",
    "    else:\n",
    "      index = 0  # dictionary['UNK']\n",
    "      unk_count += 1\n",
    "    data.append(index)\n",
    "  count[0][1] = unk_count\n",
    "  reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))\n",
    "  return data, count, dictionary, reverse_dictionary\n",
    "\n",
    "data, count, dictionary, reverse_dictionary = build_dataset(words, vocabulary_size)\n",
    "\n",
    "# Python uses reference counting as its memory management method.\n",
    "# delete a name from a namespace will descrease the count of references to an object by one\n",
    "# since we only have 1 reference to the underlaying \"word\" object\n",
    "# deleting the name \"word\" will get rid of all references to that object\n",
    "# then this object will be freed from memory\n",
    "del words  # Hint to reduce memory.\n",
    "\n",
    "# Reference counting is different from gabage collection, which is used by JVM for example.\n",
    "\n",
    "# Gabage collection happens when the runtime schedules to do so.\n",
    "# When gabage collection happens is beyond the control of programmers.\n",
    "# But, in general, gabage collection is faster than reference counting.\n",
    "\n",
    "# Reference counting is slower because it happens every time when a refernece is created or deleted\n",
    "# But it offers programmers more control.\n",
    "\n",
    "print('Most common words (+UNK)', count[:5])\n",
    "print('Sample data', data[:10], [reverse_dictionary[i] for i in data[:10]])\n",
    "\n",
    "data_index = 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3081 originated -> 5241 anarchism\n",
      "3081 originated -> 12 as\n",
      "12 as -> 6 a\n",
      "12 as -> 3081 originated\n",
      "6 a -> 12 as\n",
      "6 a -> 195 term\n",
      "195 term -> 2 of\n",
      "195 term -> 6 a\n"
     ]
    }
   ],
   "source": [
    "# Step 3: Function to generate a training batch for the skip-gram model.\n",
    "def generate_batch(batch_size, num_skips, skip_window):\n",
    "  global data_index\n",
    "  assert batch_size % num_skips == 0\n",
    "  assert num_skips <= 2 * skip_window\n",
    "  batch = np.ndarray(shape=(batch_size), dtype=np.int32)\n",
    "  labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)\n",
    "  span = 2 * skip_window + 1  # [ skip_window target skip_window ]\n",
    "  buffer = collections.deque(maxlen=span)\n",
    "  for _ in range(span):\n",
    "    buffer.append(data[data_index])\n",
    "    data_index = (data_index + 1) % len(data)\n",
    "  for i in range(batch_size // num_skips):\n",
    "    target = skip_window  # target label at the center of the buffer\n",
    "    targets_to_avoid = [skip_window]\n",
    "    for j in range(num_skips):\n",
    "      while target in targets_to_avoid:\n",
    "        target = random.randint(0, span - 1)\n",
    "      targets_to_avoid.append(target)\n",
    "      batch[i * num_skips + j] = buffer[skip_window]\n",
    "      labels[i * num_skips + j, 0] = buffer[target]\n",
    "    buffer.append(data[data_index])\n",
    "    data_index = (data_index + 1) % len(data)\n",
    "  # Backtrack a little bit to avoid skipping words in the end of a batch\n",
    "  data_index = (data_index + len(data) - span) % len(data)\n",
    "  return batch, labels\n",
    "\n",
    "batch, labels = generate_batch(batch_size=8, num_skips=2, skip_window=1)\n",
    "for i in range(8):\n",
    "  print(batch[i], reverse_dictionary[batch[i]], '->', labels[i, 0], reverse_dictionary[labels[i, 0]])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Step 4: Build and train a skip-gram model.\n",
    "batch_size = 128\n",
    "embedding_size = 128  # Dimension of the embedding vector.\n",
    "skip_window = 1       # How many words to consider left and right.\n",
    "num_skips = 2         # How many times to reuse an input to generate a label.\n",
    "\n",
    "# We pick a random validation set to sample nearest neighbors. Here we limit the\n",
    "# validation samples to the words that have a low numeric ID, which by\n",
    "# construction are also the most frequent.\n",
    "valid_size = 16     # Random set of words to evaluate similarity on.\n",
    "valid_window = 100  # Only pick dev samples in the head of the distribution.\n",
    "valid_examples = np.random.choice(valid_window, valid_size, replace=False)\n",
    "num_sampled = 64    # Number of negative examples to sample.\n",
    "\n",
    "\n",
    "graph = tf.Graph()\n",
    "with graph.as_default():\n",
    "\n",
    "  # Input data.\n",
    "  train_inputs = tf.placeholder(tf.int32, shape=[batch_size])\n",
    "  train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])\n",
    "  valid_dataset = tf.constant(valid_examples, dtype=tf.int32)\n",
    "\n",
    "  # Ops and variables pinned to the CPU because of missing GPU implementation\n",
    "  with tf.device('/cpu:0'):\n",
    "    # Look up embeddings for inputs.\n",
    "    embeddings = tf.Variable(\n",
    "      tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))\n",
    "    embed = tf.nn.embedding_lookup(embeddings, train_inputs)\n",
    "\n",
    "    # Construct the variables for the NCE loss\n",
    "    nce_weights = tf.Variable(\n",
    "      tf.truncated_normal([vocabulary_size, embedding_size],\n",
    "                          stddev=1.0 / math.sqrt(embedding_size)))\n",
    "    nce_biases = tf.Variable(tf.zeros([vocabulary_size]))\n",
    "\n",
    "  # Compute the average NCE loss for the batch.\n",
    "  # tf.nce_loss automatically draws a new sample of the negative labels each\n",
    "  # time we evaluate the loss.\n",
    "  loss = tf.reduce_mean(\n",
    "    tf.nn.nce_loss(weights=nce_weights,\n",
    "                   biases=nce_biases,\n",
    "                   labels=train_labels,\n",
    "                   inputs=embed,\n",
    "                   num_sampled=num_sampled,\n",
    "                   num_classes=vocabulary_size))\n",
    "\n",
    "  # Construct the SGD optimizer using a learning rate of 1.0.\n",
    "  optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)\n",
    "\n",
    "  # Compute the cosine similarity between minibatch examples and all embeddings.\n",
    "  norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))\n",
    "  normalized_embeddings = embeddings / norm\n",
    "  valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)\n",
    "  similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)\n",
    "\n",
    "  # Add variable initializer.\n",
    "  init = tf.global_variables_initializer()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Step 5: Begin training.\n",
    "num_steps = 100001\n",
    "LOG_DIR = './log/'\n",
    "\n",
    "with tf.Session(graph=graph) as session:\n",
    "  # We must initialize all variables before we use them.\n",
    "  init.run()\n",
    "  print(\"Initialized\")\n",
    "\n",
    "  average_loss = 0\n",
    "  for step in xrange(num_steps):\n",
    "    batch_inputs, batch_labels = generate_batch(\n",
    "      batch_size, num_skips, skip_window)\n",
    "    feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels}\n",
    "\n",
    "    # We perform one update step by evaluating the optimizer op (including it\n",
    "    # in the list of returned values for session.run()\n",
    "    _, loss_val = session.run([optimizer, loss], feed_dict=feed_dict)\n",
    "    average_loss += loss_val\n",
    "\n",
    "    if step % 2000 == 0:\n",
    "      if step > 0:\n",
    "        average_loss /= 2000\n",
    "      # The average loss is an estimate of the loss over the last 2000 batches.\n",
    "      print(\"Average loss at step \", step, \": \", average_loss)\n",
    "      average_loss = 0\n",
    "\n",
    "\n",
    "  \"\"\"\n",
    "  Use TensorBoard to visualize our model. \n",
    "  This is not included in the TensorFlow website tutorial.\n",
    "  \"\"\"\n",
    "  words_to_visualize = 3000\n",
    "  final_embeddings = normalized_embeddings.eval()[:words_to_visualize]\n",
    "  embedding_var = tf.Variable(final_embeddings)\n",
    "  session.run(embedding_var.initializer)\n",
    "  saver = tf.train.Saver([embedding_var])\n",
    "  saver.save(session, os.path.join(LOG_DIR, \"model.ckpt\"), 0)\n",
    "\n",
    "  # Format: tensorflow/contrib/tensorboard/plugins/projector/projector_config.proto\n",
    "  config = projector.ProjectorConfig()\n",
    "\n",
    "  # You can add multiple embeddings. Here we add only one.\n",
    "  embedding = config.embeddings.add()\n",
    "  embedding.tensor_name = embedding_var.name\n",
    "  # Link this tensor to its metadata file (e.g. labels).\n",
    "  embedding.metadata_path = os.path.join(LOG_DIR, 'metadata.tsv')\n",
    "\n",
    "  # Use the same LOG_DIR where you stored your checkpoint.\n",
    "  summary_writer = tf.summary.FileWriter(LOG_DIR)\n",
    "  summary_writer.add_graph(graph)\n",
    "\n",
    "  # The next line writes a projector_config.pbtxt in the LOG_DIR. TensorBoard will\n",
    "  # read this file during startup.\n",
    "  projector.visualize_embeddings(summary_writer, config)\n",
    "\n",
    "  # Write the metadata file to disk so that TensorBoard can find it later\n",
    "  labels = [(reverse_dictionary[i], i) for i in range(words_to_visualize)]\n",
    "  DataFrame(labels, columns=['word', 'freq_rank']).to_csv('log/metadata.tsv', index=False, sep='\\t')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A Gensim Example\n",
    "Besides the [TensorFlow example](word2vec.py), I show you a Gensim example here.\n",
    "\n",
    "[Gensim](https://radimrehurek.com/gensim/) is a topic modelling toolkit. It has word2vec implemented."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "### First, get the data\n",
    "### We use the complete works of Shakespear\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References\n",
    "### Paper\n",
    "[Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781.pdf)  \n",
    "[Distributed Representations of Words and Phrases and their Compositionality](https://arxiv.org/pdf/1310.4546.pdf)  \n",
    "\n",
    "\n",
    "### Other Tutorials\n",
    "[A post by Chris McCormick](http://mccormickml.com/2016/04/27/word2vec-resources/)  \n",
    "[Second post by Chris McCormick](http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/)  \n",
    "\n",
    "### TensorFlow\n",
    "[TensorFlow Tutorial](https://www.tensorflow.org/tutorials/word2vec)  \n",
    "[TensorFlow Example Code](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/word2vec/word2vec_basic.py)  "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
